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Collaborative tagging has been quickly gaining ground because of its ability to recruit the activity 
of web users into effectively organizing and sharing vast amounts of information. Here we collect 
data from a popular system and investigate the statistical properties of tag co-occurrence. We 
introduce a stochastic model of user behavior embodying two main aspects of collaborative tagging: 
(i) a frequency-bias mechanism related to the idea that users are exposed to each other's tagging 
activity; (ii) a notion of memory - or aging of resources - in the form of a heavy-tailed access to 
the past state of the system. Remarkably, our simple modeling is able to account quantitatively for 
the observed experimental features, with a surprisingly high accuracy. This points in the direction 
of a universal behavior of users, who - despite the complexity of their own cognitive processes and 
the uncoordinated and selfish nature of their tagging activity - appear to follow simple activity 
patterns. 
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FIG. 1: Schematic depiction of the collaborative tagging pro- 
cess: web users are exposed to a resource and freely associate 
tags with it. Their interaction with the system also exposes 
them to tags previously entered by themselves and by other 
users. The aggregated activity of users leads to an emergent 
categorization of resources in terms of tags shared by a com- 
munity. 



1. INTRODUCTION 

Recently, a new paradigm has been quickly gaining 
ground on the World-Wide Web: Collaborative Tag- 
ging 0, 0, Q. In web applications like del. icio. us 1 , 
Flickr 2 , CiteULike 3 , Connotea 4 , users manage, share 
and browse collections of online resources by enriching 
them with semantically meaningful information in the 
form of freely chosen text labels (tags). The paradigm 
of collaborative tagging has been successfully deployed in 
web applications designed to organize and share diverse 
online resources such as bookmarks, digital photographs, 
academic papers, music and more. Web users interact 
with a collaborative tagging system by posting content 
(resources) into the system, and associating text strings 
(tags) with that content, as shown in Fig. ^ At the 
global level the set of tags, though determined with no 
explicit coordination, evolves in time and leads towards 
patterns of terminology usage that are shared by the en- 
tire user community. Hence one observes the emergence 
of a loose categorization system - commonly referred to 
as folksonomy - that can be effectively used to navigate 
through a large and heterogeneous body of resources. 

Focusing on tags as basic dynamical entities, the pro- 
cess of collaborative tagging falls within the scope of 
Semiotic Dynamics 0, Hj, a new field that studies how 
populations of humans or agents can establish and share 
semiotic systems, typically driven by their use in com- 



*Electronic address: ciro.cattuto@romal.infn.it 



1 http://del.icio.us 

2 http://www.flickr.com 

3 http://www.citeulike.org 



http : //www . connotea . org 



2 



munication. Indeed the emergence of a folksonomy ex- 
hibits dynamical aspects also observed in human lan- 
guages jja, Q , such as the crystallization of naming con- 
ventions, competition between terms, takeovers by neol- 
ogisms, and more. 

In the following we adopt the point of view of complex 
systems science and try to understand how the "micro- 
scopic" tagging activity of users causes the emergence of 
the high-level features we observe for the ensuing folk- 
sonomy. We ground our analysis on actual tagging data 
extracted from del.icio.us and Connotea and use stan- 
dard statistical tools to gain insights into the underlying 
tagging dynamics. Based on this, we introduce a simple 
stochastic model for the tagging behavior of an "average" 
user, and show that such a model - despite its simplic- 
ity - is able to reproduce extremely well some of the 
observed properties. We close giving an interpretation 
of the model parameters and pointing out directions for 
future research. 



2. EXPERIMENTAL DATA 

The activity of users interacting with a collaborative 
tagging system consists of either navigating the existing 
body of resources by using tags, or adding new resources 
to the system. In order to add a new resource to the sys- 
tem, the user is prompted for a reference to the resource 
and a set of tags to associate with it. Thus the basic 
unit of information in a collaborative tagging system is 
a (user, resource, {tags}) triple, here referred to as 
post. Tagging events build a tri-partite graph (with parti- 
tions corresponding to users, resources and tags, respec- 
tively) and such a graph can be subsequently used as a 
navigation aid in browsing tagged information. Usually a 
post contains also a temporal marker indicating the phys- 
ical time of the tagging event, so that temporal ordering 
can be preserved in storing and retrieving posts. 

Our analysis will focus on Del.icio.us, for several rea- 
sons: i) it was the very first system to deploy the 
ideas and technologies of collaborative tagging, and the 
paradigmatic character it acquired makes it a natural 
starting point for any quantitative study, ii) because of 
its popularity, it has a large community of active users 
and comprises a precious body of raw data on the static 
and dynamical properties of a folksonomy. iii) it is a 
broad folksonomy j8j, and single tagging events (posts) 
retain their identity and can be individually retrieved. 
This affords unimpeded access to the "microscopic" dy- 
namics of collaborative tagging, providing the opportu- 
nity to make contact between emergent behaviors and 
low-level dynamics. It also allows to define and measure 
the multiplicity (or frequency) of tags in the context of 
a single resource. Contrary to this, popular sites falling 
in the narrow folksonomy class (Flickr, for example) fos- 
ter a different model of user interaction, where tags are 
mostly applied by the content creator, no notion of tag 
multiplicity is available in the context of a resource, and 



TABLE I: Statistics of the datasets used for the co-occurrence 
analysis. For each tag in the first column we report the num- 
ber of posts marked with that tag, the number of total and 
distinct tags co-occurring with it, and the corresponding num- 
ber of resources. The data were retrieved during May 2005. 



Tag 


No. posts 


No. tags 


No. distinct tags 


No. resources 


Blog 


37974 


124171 


10617 


16990 


Ajax 


33140 


108181 


4141 


2995 


Xml 


24249 


108013 


6035 


7364 


H5N1 


981 


5185 


241 


969 



no access is given to the raw sequence of tagging events. 

On studying Del.icio.us we adopt a tag-centric view 
of the system, that is we investigate the evolving rela- 
tionship between a given tag and the set of tags that 
co-occur with it. In line with our focus on semiotic dy- 
namics, we factor out the detailed identity of the users 
involved in the process, and only deal with streams of 
tagging events and their statistical properties. To per- 
form automated data collection of raw data we use a 
custom web (HTTP) client that connects to del.icio.us 
and navigates the system's interface as an ordinary user 
would do, extracting the relevant metadata and storing 
it for further post-processing, del.icio.us allows the user 
to browse its content by tag: our client requests the web 
page associated with the tag under study and uses an 
HTML parser to extract the post information (user, re- 
source, tags, time stamp) from the returned HTML code. 
Fig. [21 graphically depicts the raw data we gather, for the 
case of two popular tags on del.icio.us. Table [I] describes 
the datasets we used for the present analysis. 



3. DATA ANALYSIS 

Here we analyze data from del.icio.us and Connotea 
and investigate the statistical properties of tag associa- 
tion. Specifically, we select a semantic context by ex- 
tracting the resources associated with a given tag X 
and study the statistical distribution of tags co-occurring 
with X (see Table 1). Fig. [21 graphically illustrates the 
associations between tags and posts, and Fig. reports 
the frequency-rank distributions for the tags co-occurring 
with a few selected ones. The high-rank tail of the exper- 
imental curves displays a power-law behavior, signature 
of an emergent hierarchical structure, corresponding to a 
generalized Zipf 's law [9| with an exponent between 1 and 
2. Since power laws are the standard signature of self- 
organization and of human activity |ld [ll| , the presence 
of a power-law tail is not surprising. The observed value 
of the exponent, however, deserves further investigation 
because the mechanisms usually invoked to explain Zipf 's 
law and its generalizations [IJ don't look very realistic 
for the case at hand, and a mechanism grounded on ex- 
perimental data should be sought. 
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FIG. 2: Tagging activity: a time-ordered sequence of tagging events is graphically rendered by marking the tags co-occurring 
with blog (top panel) or ajax (bottom panel) in an experimental sequence of posts on del.icio.us. In each panel, columns 
represent single tagging events (posts) and rows correspond to the 10 most frequent tags co-occurring with either blog (top 
panel) or ajax (bottom panel). 100 tagging events are shown in each panel, temporally ordered from left to right. Only posts 
involving at least one of the 10 top-ranked tags are shown. For each tagging event (column) , a filled cell marks the presence of 
the tag in the corresponding row, while an empty cell indicates its absence. A qualitative difference between blog (top panel) 
and ajax (bottom panel) is clearly visible, where a higher density at low-rank tags characterizes the semantically narrower ajax 
term. This corresponds to the steeper low-rank behavior observed in the frequency- rank plot for ajax (Fig. [3J . 



Moreover, the low-rank part of the frequency-rank 
curves exhibits a flattening typically not observed in sys- 
tems strictly obeying Zipf 's law. Several aspects of the 
underlying complex dynamics may be responsible for this 
feature: on the one hand this behavior points to the exis- 
tence of semantically equivalent and possibly competing 
high-frequency tags (e.g. blog and blogs). More impor- 
tantly, this flattening behavior may be ascribed to an 
underlying hierarchical organization of tags co-occurring 
with the one we single out: more general tags (semanti- 
cally speaking) will tend to co-occur with a larger number 
of other tags. In this scenario, we expect a shallower be- 
havior for tags co-occurring with generic tags (e.g. blog) 
and a steeper behavior for semantically narrow tags (e.g. 
ajax, see also Fig. To better probe the validity of 
this interpretation, we investigate the co-occurrence re- 
lationship that links high-rank tags, lying well within the 
power-law tail, with low-rank tags located in the shallow 
part of the distribution. Our observations (see section [SJ) 
point in the direction of a non-trivial hierarchical orga- 
nization emerging out of the collective tagging activity, 
with each low-rank tag leading its own hierarchy of se- 
mantically related higher-rank tags, and all such hierar- 
chies merging into the overall power-law tail. 



4. A YULE-SIMON'S MODEL WITH 
LONG-TERM MEMORY 

We now aim at gaining a deeper insight into the phe- 
nomenology reported above. In order to model the ob- 
served frequency-rank behavior for the full range of rank- 
ing values, we introduce a new version of the "rich-get- 
richer" Yule-Simon's stochastic model 0,0] by enhanc- 
ing it with a fat-tailed memory kernel. The original 
model can be described as the construction of a text from 
scratch. At each discrete time step one word is appended 



to the text: with probability p the appended word is a 
new word, never occurred before, while with probability 
1 — p one word is copied from the existing text, choosing 
it with a probability proportional to its current frequency 
of occurrence. This simple process yields frequency-rank 
distributions that display a power-law tail with exponent 
a = 1 — p, lower than the exponents we observe in actual 
data. This happens because the Yule-Simon process has 
no notion of "aging" , i.e. all positions within the text are 
regarded as identical. 

In our construction we moved from the observation 
that actual users are exposed in principle to all the tags 
stored in the system (like in the original Yule-Simon 
model) but the way in which they choose among them, 
when tagging a new resource, is far from being uniform 
in time (see also 0,0]). It seems more realistic to as- 
sume that users tend to apply recently added tags more 
frequently than old ones, according to a memory kernel 
which might be highly skewed. Indeed, recent findings 
about human activities |ll| support the idea that the ac- 
cess pattern to the past of the system should be fat-tailed, 
suggesting a power-law memory kernel. 

We tested this hypothesis with real data extracted 
from del.icio.us: Fig. 0] shows the temporal auto- 
correlation function for the sequence of tags co-occurring 
with blog. Such a sequence is constructed by consecu- 
tively appending the tags associated with each post, re- 
specting the temporal order of posts. Correlations are 
computed inside three consecutive windows of length T, 
starting at different times t w , 

t=t w +T-At 

C{At,t w ) = ^— ^ ]T a(tag(t + At),tag(t)), 
t=t w +i 

where <$(tag(£ + At), tag(t)) is the usual Kronecker delta 
function, taking the value 1 when the same tag occurs at 
times t and t+At. From Fig.0|it is apparent that the cor- 
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FIG. 3: Frequency-rank plots for tags co-occurring with a selected tag: experimental data (black symbols) are shown for 
del.icio.us (circles for tags co-occurring with the popular tag blog, squares for ajax and triangles for xml) and Connotea (inset, 
black circles for the H5N1 tag). For the sake of clarity, the curves for ajax and xml are shifted down by one and two decades, 
respectively. Details about the experimental datasets are reported in Table [i] All curves exhibit a power-law decay for high 
ranks (a dashed line corresponding to the power law R~ 5 ^ A is provided as an aid for eye) and a shallower behavior for low 
ranks. To make contact with Fig. [5] some of the highest-frequency tags co-occurring with blog and ajax are explicitly indicated 
with arrows. Red symbols are theoretical data obtained by computer simulation of the stochastic process described in the text 
(Fig. 01. The parameters of the model, i.e. the probability p, the memory parameter r and the initial number of words no 
were adjusted to match the experimental data, giving approximately p = 0.06, r = 100 and no = 100 for blog, p = 0.03, r = 20 
and no = 50 for ajax, and p = 0.034, r = 40 and no = 110 for xml. Inset: Connotea is a much younger system than del.icio.us 
and the corresponding dataset is smaller and noisier. Nevertheless, a good match with experimental data can be obtained for 
p — 0.05, r = 120 and no = 7 (red circles), demonstrating that our model also applies to the early stages of development of a 
folksonomy. Gray circles correspond to different realizations of the simulated dynamics. 



relation function is non-stationary over time. Moreover, 
for each value of the initial time t w a power-law behav- 
ior is observed: C(At,t w ) = a(t w )/(At + S(t w )) + c(t w ), 
where a(t w ) is a time-dependent normalization factor 
and S(t w ) is a phenomenological time scale, slowly in- 
creasing with the "age" t w of the system. c(t w ) is the 
correlation that one would expect in a random sequence 
of tags distributed according to the frequency-rank dis- 
tribution Pt,i w (R) pertaining to the relevant data win- 
dow. Denoting by R — Rm&x(T,t w ) the number of dis- 
tinct tags occurring in the window [t w ,t w + T] , we have 

Our modification of the Yule-Simon's model thus con- 
sists in weighting the probability of choosing an existing 
word (tag) according to a power-law kernel. This hypoth- 



esis about the functional form of the memory kernel is 
also supported by findings in Cognitive Psychology [l7| . 
where power laws of latency and frequency have been 
shown to model human memory. 

Summarizing, our model of users' behavior can be 
stated as follows: the process by which users of a col- 
laborative tagging system associate tags to resources can 
be regarded as the construction of a "text" , built one step 
at a time by adding "words" (i.e. tags) to a text initially 
comprised of uq words. This process is meant to model 
the behavior of an effective average user in the context 
identified by a specific tag. At a generic (discrete) time 
step t, a brand new word may be invented with proba- 
bility p and appended to the text, while with probability 
1 — p one word is copied from the existing text, going 
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FIG. 4: Tag-tag correlation functions and non-stationarity. The tag-tag correlation function C(At,t w ) is computed over three 
consecutive and equally long (T = 30000 tags each) subsets of the blog dataset, starting respectively at positions t\, = 10000, 
<to = 40000 and t%, = 70000 within the collected sequence. Short-range correlations are clearly visible, slowly decaying towards 
a long-range plateau value. The non-stationary character of correlations is visible both at short range, where the value of the 
correlation function decays with t w , and at long range, where the asymptotic correlation increases with t w . The long-range 
correlations (dashed lines) can be estimated as the natural correlation present in a random sequence containing a finite number 
of tags: on using the appropriate ranked distribution of tag frequencies within each window (see text) the values c(ii,), c(t^,) 
and c(t%) can be computed, matching the measured plateau of the correlation functions. The thick line is a fit to the fat-tailed 
memory kernel described in the text. 



back in time by x steps with a probability Qt{x) that 
decays as a power law, Qt(x) = a(t)/(x + r). a(t) is a 
normalization factor and r is a characteristic time scale 
over which recently added words have comparable proba- 
bilities. Fig.|3]shows the excellent agreement between the 
experimental data and the numerical predictions of our 
Yule-Simon's model with long-term memory. Our model, 
unsurprisingly, also reproduces the temporal correlation 
behavior observed in real data (see 01 f° r a discussion 
of this point). 

The interpretation of r (similar to that of the S param- 
eter introduced above for tag-tag correlations) is related 
to the number of equivalent top-ranked tags perceived 
by users as semantically independent (see section (SJ. In 
our model, in fact, the average user is exposed to a few 
roughly equivalent top-ranked tags and this is translated 
mathematically into a low-rank cutoff of the power law, 
i.e. the observed low-rank flattening. 

Fitting the parameters of the model, in order to match 
its predictions (obtained by computer simulation) against 
the experimental data, we obtain an excellent agreement 
for all the frequency-rank curves we measured, as shown 



in Fig. [3J This is a clear indication that the tagging 
behavior embodied in our simple model captures some 
key features of the tagging activity. The parameter r 
controls the number of top-ranked tags which are allowed 
to co-occur with comparable frequencies, so that it can 
be interpreted as a measure of the "semantic breadth" of 
a tag. This picture is consistent with the fact that the 
fitted value of r obtained for blog (a rather generic tag) 
is larger than the one needed for ajax (a pretty specific 
one) . Additional information on the role of r as well as 
that of p in the framework of our model are reported in 
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FIG. 5: A Yule-Simon's process with long-term memory. A 
synthetic stream of tags is generated by iterating the following 
step: with probability p a new tag is created and appended to 
the stream, while with probability 1 — p a tag is copied from 
the past of the stream and appended to it. The probability 
of selecting a tag located x steps into the past is given by the 
long-range memory kernel Qt(x), which provides a fat-tailed 
access to the past of the stream. 



5. CO-OCCURRENCE BETWEEN HIGH-RANK 
AND LOW-RANK TAGS 

Fig. El shows a table where the occurrence of 30 high- 
rank (low-frequency) tags is related to the occurrence of 
the 15 lowest-rank (highest-frequency) tags. All the tags 
under study are co-occurring with the tag blog and the 
dataset used for the analysis is the same as the one used 
in Fig. |21 The co-occurrence analysis is performed as 
follows: given a high-rank tag X, all resources tagged 
with X (within the above dataset) are selected, and the 
co-occurrence frequencies of X with each of the 15 top- 
ranked (most frequent) tags are recorded. Thus, each 
row of the table associates a tag X with the correspond- 
ing (normalized) co-occurrence histogram. This provides 
a statistical characterization of tag X in terms of the 
top-ranked tags, regarded as a natural basis for catego- 
rization (or semantic "grounding" ) . Fig. graphically 
illustrates such a "tag fingerprint" for 5 high-rank tags, 
arbitrarily chosen. This analysis is aimed at probing the 
existence of non-trivial co-occurrence relationships that 
might be ascribed to semantics and - possibly - to the 
emergence of a self-organized hierarchy of tags. As shown 
by the bold numbers in Fig. EJ as well as by the graph in 
Fig. high-frequency (low-rank) tags do not trivially co- 
occur with most of the low-frequency (high-rank) tags — 
on the contrary, the co-occurrence profile of the latter is 
peaked in correspondence of specific, semantically related 
tags (economics and law with politics, for example, see 
Fig. 0| . Moreover, several low- frequency (high-rank) tags 
never co-occur with some of the highest-frequency (low- 
rank) tags, as shown by the several zeros in Fig. H3 This 
suggests that high-frequency tags partition - or "catego- 
rize" - the resources marked by tags of lower frequency. 
Given that our definitions of "high-rank" and "low-rank" 
are somehow arbitrary, and given the self-similar charac- 



ter of tag association we observed (Fig. 0), we expect 
our observations to be representative of a general and 
complex semiotic structure underlying folksonomies. 



6. CONCLUSIONS 

Uncovering the mechanisms governing the emergence 
of shared categorizations or vocabularies in absence of 
global coordination is a key problem with significant sci- 
entific and technological potential. Collaborative tag- 
ging provides a precious opportunity to both analyze the 
emergence of shared conventions and inspire the design of 
large (human or artificial) agent systems. Here we report 
a statistical analysis of tagging activity in a popular so- 
cial bookmarking system, and introduce a simple stochas- 
tic model of user behavior which is able to reproduce the 
measured co-occurrence properties to a surprisingly level 
of accuracy. Our results suggest that users of collabo- 
rative tagging systems share universal behaviors which, 
despite the intricacies of personal categorization, tagging 
procedures and user interactions, appear to follow sim- 
ple activity patterns. In addition to the findings reported 
and discussed in this paper, our approach constitutes a 
starting point upon which studies of greater complexity 
can be based, with the final goal of understanding, pre- 
dicting and controlling the Semiotic Dynamics of online 
social systems. 
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FIG. 6: Co-occurrence table: columns correspond to the 15 top-ranked tags co-occurring with blog, in descending order of 
frequency from left to right. Rows correspond to 30 low- frequency tags co-occurrinng with blog (frequencies ranking between 
100th and 200th). Each row is a normalized co-occurrence histogram representing a "categorization" of the corresponding tag 
in terms of the top-ranked tags. Numbers in red (bold face) denote co-occurrence probabilities in excess of 25%. Zeros (no 
co-occurrence) are marked in blue (bold face). 



■ economics ■ law typography ■ socialsoftware python 
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0.45 




FIG. 7: Co-occurrence patterns for 5 of the low- frequency (high-rank) tags of Fig. [S] (see legend at the top). The colored 
bars display the "fingerprint" of the selected tags in terms of their co-occurrence with the 15 top-ranked tags (the same ones 
reported in the top row of Fig. |SJ. 
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